Background

Artificial intelligence (AI) is becoming increasingly popular amongst medical professionals. With the creation of ChatGPT and OpenEvidence, physicians and healthcare workers have convenient access to AI platforms to help with patient care. Within the field of oncology, the National Comprehensive Cancer Network (NCCN) guidelines remain a governing body for treating cancers.

Acute lymphoblastic leukemia (ALL) accounts for approximately one third of all childhood malignancies and is the most common cancer in children1,2. Approximately 2500 to 3500 cases of ALL are diagnosed each year in children, with an annual incidence of approximately 3.4 cases per 1000001.

Our goal of this study was to evaluate how ChatGPT and OpenEvidence answer diagnosis and treatment related questions regarding ALL.

Methods

We conducted a cross sectional study to determine how ChatGPT answers diagnosis and treatment related questions regarding ALL. To do so, we created 10 questions specific to ALL and uploaded them to the following AI platforms under the following conditions: ChatGPT without any input, ChatGPT with the current NCCN guidelines uploaded, and OpenEvidence. We then reviewed the output and assigned scoring to the AI generated answers for the following categories: accuracy (scored 0-5, with 0 being a non-answer and 5 being completely accurate); completeness (0 being incomplete and 2 being fully complete); presence of citations (assigned a yes/no).

Results

For the ChatGPT without NCCN guidelines condition, the mean accuracy of the output was 3.6 with a standard deviation of 1.7; citations were mentioned 10% of the time. For the ChatGPT with the NCCN guidelines uploaded condition, the mean accuracy of the output was 4.2 with a standard deviation of 1.3; citations were included 10% of the time. For the OpenEvidence condition, the mean accuracy of the output was 4.6 with a standard deviation of 1.0; citations were included 100% of the time.

For the accuracy and completeness scoring, we performed analysis of variance (ANOVA) to determine if the mean values were statistically different from each other. We defined p=0.05 as being statistically significant. One-way ANOVA revealed no statistically significant difference in mean accuracy between the three groups (F(2, 27) = 1.36, p = 0.27) and no statistically significant difference in the mean completeness between the three groups (F(2, 27) = 0.77, p = 0.47).

Discussion

The mean accuracy or completeness of OpenEvidence, ChatGPT with the current NCCN guidelines uploaded, and ChatGPT without NCCN guidelines are not statistically different from each other. Of note, no AI tool was able to accurately answer preferred treatment options. More data is required to distinguish if more detailed questioning could fix this issue. These results lack power. Further testing with more questioning is needed to enhance power.

This study highlights the need to assess how we use AI in healthcare. Specifically, it emphasizes the need to determine best practices for choosing the input we provide to AI and for determining which AI platform we use. AI is constantly evolving, thus one key point from this project is determining how to best use it to suit our current needs.

This content is only available as a PDF.
Sign in via your Institution